CRM114 (program)
   HOME

TheInfoList



OR:

CRM114 (full name: "The CRM114 Discriminator") is a program based upon a statistical approach for classifying data, and especially used for filtering
email spam Email spam, also referred to as junk email, spam mail, or simply spam, is unsolicited messages sent in bulk by email (spamming). The name comes from a Monty Python sketch in which the name of the canned pork product Spam is ubiquitous, unavoida ...
.


Origin of the name

The name comes from the CRM-114 Discriminator in the
Stanley Kubrick Stanley Kubrick (; July 26, 1928 – March 7, 1999) was an American film director, producer, screenwriter, and photographer. Widely considered one of the greatest filmmakers of all time, his films, almost all of which are adaptations of nove ...
movie
Dr. Strangelove ''Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb'', known simply and more commonly as ''Dr. Strangelove'', is a 1964 black comedy film that satirizes the Cold War fears of a nuclear conflict between the Soviet Union and t ...
- a piece of radio equipment designed to filter out messages lacking a specific code-prefix.


Operation

While others have done statistical
Bayesian spam filtering Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically use bag-of-words features to identify email spam, an approach commonly used in text classification. Naive Bayes classifiers work by correlating th ...
based upon the frequency of single word occurrences in email, CRM114 achieves a higher rate of spam recognition through creating hits based upon phrases up to five words in length. These phrases are used to form a
Markov Random Field In the domain of physics and probability, a Markov random field (MRF), Markov network or undirected graphical model is a set of random variables having a Markov property described by an undirected graph. In other words, a random field is said to b ...
representing the incoming texts. With this additional contextual recognition, it is one of the more accurate spam filters available. Initial testing in 2002 by author Bill Yerazunis gave a 99.87% accuracy; Holden ''Spam Filtering II''
/ref> and TREC 2005 and 2006''Spam Track Overview'' (2005)
- TREC 2005
''Spam Track Overview'' (2006)
- TREC 2005
gave results of better than 99%, with significant variation depending on the particular corpus. CRM114's classifier can also be switched to use Littlestone's Winnow algorithm, character-by-character
correlation In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...
, a variant on KNN (
K-nearest neighbor algorithm In statistics, the ''k''-nearest neighbors algorithm (''k''-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and reg ...
) classification called Hyperspace, a bit-entropic classifier that uses
entropy encoding In information theory, an entropy coding (or entropy encoding) is any lossless data compression method that attempts to approach the lower bound declared by Shannon's source coding theorem, which states that any lossless data compression method ...
to determine similarity, a SVM, by mutual compressibility as calculated by a modified
LZ77 LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as LZ1 and LZ2 respectively. These two algorithms form the basis for many variations includin ...
algorithm, and other more experimental classifiers. The actual features matched are based on a generalization of skip-grams. The CRM114 algorithms are multi-lingual (compatible with
UTF-8 UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...
encodings) and null-safe. A voting set of CRM114 classifiers have been demonstrated to detect confidential versus non-confidential documents written in
Japanese Japanese may refer to: * Something from or related to Japan, an island country in East Asia * Japanese language, spoken mainly in Japan * Japanese people, the ethnic group that identifies with Japan through ancestry or culture ** Japanese diaspor ...
at better than 99.9% detection rate and a 5.3% false alarm rate. CRM114 is a good example of
pattern recognition Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphi ...
software, demonstrating how machine learning can be accomplished with a reasonably simple algorithm. The program's C source code is available under the
GPL The GNU General Public License (GNU GPL or simply GPL) is a series of widely used free software licenses that guarantee end users the four freedoms to run, study, share, and modify the software. The license was the first copyleft for general u ...
. At a deeper level, CRM114 is also a string pattern matching language, similar to
grep grep is a command-line utility for searching plain-text data sets for lines that match a regular expression. Its name comes from the ed command ''g/re/p'' (''globally search for a regular expression and print matching lines''), which has the sam ...
or even
Perl Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offici ...
; although it is
Turing complete Alan Mathison Turing (; 23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher, and theoretical biologist. Turing was highly influential in the development of theoretical co ...
it is highly tuned for matching text, and even a simple (recursive) definition of the factorial takes almost ten lines. Part of this is because the crm114 language syntax is not positional, but
declension In linguistics, declension (verb: ''to decline'') is the changing of the form of a word, generally to express its syntactic function in the sentence, by way of some inflection. Declensions may apply to nouns, pronouns, adjectives, adverbs, and ar ...
al. As a programming language, it may be used for many other applications aside from detecting spam. CRM114 uses the TRE approximate-match
regex A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or ...
engine, so it is possible to write programs that do not depend on absolutely identical strings matching to function correctly. CRM114 has been applied to email filtering in the KMail client and a number of other applications, including detection of bots on Twitter and Yahoo, as well as the first-level filter in the US Dept of Transportation's vehicle defect detection system. It has also been used as a predictive method for classifying fault-prone software modules.


See also

* String matching


References


External links


The CRM114 home page on SourceForge

The TRE approximate regex matcher homepage
{{DEFAULTSORT:Crm114 (Program) Spam filtering